This website serves as a course diary for a Uni. Helsinki course Introduction to Open Data Science (IODS)
Script for data wrangling is here
#Hidden
source("helper_functions.R")
library(knitr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(tidyr)
library(dplyr, warn.conflicts = F)
library(stringr)
analysis <- read.csv("data/learning2014.csv") %>% tbl_df()
glimpse(analysis)
## Observations: 166
## Variables: 7
## $ gender <fctr> F, M, F, M, M, F, M, F, M, F, M, F, F, F, M, F, F, F...
## $ Age <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 3...
## $ Attitude <int> 37, 31, 25, 35, 37, 38, 35, 29, 38, 21, 39, 38, 24, 3...
## $ deep <dbl> 3.636364, 3.000000, 3.045455, 3.500000, 3.681818, 4.3...
## $ stra <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.00...
## $ surf <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.4...
## $ Points <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 2...
The data contains learning results (points) of 166 students and the possibly associated variables: attitude, age, gender. Also icluded are students’ likert-scores on following dimensions of learning : deep, surface/superficial and strategic.
Variables starting with ST, SU, and Dnum present dimensions strategic, superficial, and deep respectively.
#Hidden
info <- read.delim("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt",
header = F, sep = "\t", encoding = "latin1")[12:71,] %>%
str_split(boundary(type = "word"), n = 2, simplify = T) %>% as.data.frame()
names(info) <- c("Variable", "Description")
info %>%
kable("html", align = "rrr", caption = "Data variable info") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(height = "200px")
| Variable | Description |
|---|---|
| Aa | Making sure you remember things well |
| Ab | Developing as a person |
| Ac | Building up knowledge by acquiring facts and information |
| Ad | Being able to use the information you’ve acquired |
| Ae | Understanding new material for yourself |
| Af | Seeing things in a different and more meaningful way |
| ST01 | I manage to find conditions for studying which allow me to get on with my work easily. |
| SU02 | Often I find myself wondering whether the work I’m doing here is really worthwhile. |
| D03 | I usually set out to understand for myself the meaning of what we have to learn. |
| ST04 | I organize my study time carefully to make the best use of it. |
| SU05 | I find I have to concentrate on just memorising a good deal of what I have to learn. |
| D06 | I look at the evidence carefully and try to reach my own conclusion about what I’m studying. |
| D07 | I try to relate ideas I come across to those in other topics or other course whenever possible. |
| SU08 | I tend to read very little beyond what is actually required to pass. |
| ST09 | I think I’m quite systematic and organised when it comes to revising for exams. |
| SU10 | There’s not much of the work here that I find interesting or relevant. |
| D11 | When I’m reading an article or book, I try to find out for myself exactly what the author means. |
| ST12 | I’m pretty good at getting down to work whenever I need to. |
| SU13 | Much of what I’m studying makes little sense: it’s like unrelated bit and pieces. |
| D14 | When I’m working on a new topic, I try to see in my own mind how all the ideas fit together. |
| D15 | Often I find myself questioning things I hear in lectures or read in books. |
| SU16 | I concentrate to learning just those bits of information I have to know to pass. |
| ST17 | I’m good at following up some of the reading suggested by lecturers or tutors. |
| SU18 | When I look back, I sometimes wonder why I ever decided to come here. |
| D19 | When I’m reading I stop from time to time to reflect on what I am trying to learn from it. |
| ST20 | I work steadily through the term or semester, rather than leave it all until the last minute. |
| SU21 | I’m not really sure what’s important in lectures, so I try to get down all I can. |
| D22 | Ideas in course books or articles often set me off on long chains of thought of my own. |
| D23 | When I read, I examine the details carefully to see how they fit in with what’s being said. |
| SU24 | I gear my studying closely to just what seems to be required for assignments and exams. |
| ST25 | I usually plan out my week’s work in advance, either on paper or in my head. |
| SU26 | I’m not really interested in this course, but I have to take it for other reasons. |
| D27 | Before tackling a problem or assignment, I first try to work out what lies behind it. |
| ST28 | I generally make good use of my time during the day. |
| SU29 | I often have trouble in making sense of the things I have to remember. |
| D30 | I like to play around with ideas of my own even if they don’t get me very far. |
| D31 | It’s important for me to be able to follow the argument, or to see the reason behind things. |
| SU32 | I like to be told precisely what to do in essays or other assignments. |
| Ca | Lecturers who tell us exactly what to put down in our notes. |
| Cb | Lecturers who encourage us to think for ourselves and show us how they themselves think. |
| Cc | Exams which hallow me to show that I’ve thought about the course material for myself. |
| Cd | Exams or tests which need only the material provided in our lecture notes. |
| Ce | Courses in which it’s made very clear just which books we have to read. |
| Cf | Courses where we’re encouraged to read around the subject a lot for ourselves. |
| Cg | Books which challenge you and provide explanations which go beyond the lectures. |
| Ch | Book which give you definite facts and information which can easily be learned. |
| Da | I feel confident I can master statistics topics |
| Db | Statistics will be useful for my future work |
| Dc | I am interested in understanding statistics |
| Dd | I did well in high school mathematics courses |
| De | I am interested in learning statistics |
| Df | I feel insecure when I have to do statistics problems |
| Dg | I like statistics |
| Dh | I am scared by statistics |
| Di | I feel confident in doing math operations |
| Dj | Statistics will be useful for my future studies |
| Age | Age (in years) derived from the date of birth |
| Attitude | Global attitude toward statistics |
| Points | Exam points |
| gender | Gender: M (Male), F (Female) |
#Hidden
ggpairs(analysis,
title = "Study variable overview",
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = .2, size = .6),
combo = wrap("facethist", bins = 10))) +
theme(
axis.text.x = element_text(angle = 90, color = "black", size = 7, vjust = .5),
axis.text.y = element_text(color = "black", size = 7))
#Hidden
summaryKable(analysis) %>%
kable("html", align = "rrr", caption = "Data variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(width = "100%")
| gender | Age | Attitude | deep | stra | surf | Points | |
|---|---|---|---|---|---|---|---|
| Min | Levels | 17.000 | 14.000 | 1.818 | 1.250 | 1.583 | 7.000 |
| 1st Q | F: 110 | 21.000 | 26.000 | 3.091 | 2.625 | 2.417 | 19.000 |
| Median | M: 56 | 22.000 | 32.000 | 3.455 | 3.188 | 2.833 | 23.000 |
| Mean | – | 25.512 | 31.428 | 3.436 | 3.121 | 2.787 | 22.717 |
| 3rd Q | – | 27.000 | 37.000 | 3.727 | 3.625 | 3.167 | 27.750 |
| Max | – | 55.000 | 50.000 | 4.455 | 5.000 | 4.333 | 33.000 |
All variables seem to be normally distributed. There is quite a bit more female subjects than males (110 vs 56). By eyeballing histograms above, there seems to be no great differences between genders, except maybe for the attitude-variable. Points, our dependent variable, has a bit of a notch on the left tail, but still reasonably follows normality. For variable age, logarithmic conversion might help to increase normality.
I chose to use attitude, deep and stra as explanatory variables for points, as they were most strongly correlated with it according to the exploratory analysis.
model <- lm(Points ~ Attitude + deep + stra, data = analysis)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude + deep + stra, data = analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.39145 3.40775 3.343 0.00103 **
## Attitude 0.41497 0.08882 4.672 6.24e-06 ***
## deep -1.37353 1.37621 -0.998 0.31974
## stra 0.96208 0.53668 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
In this model only attitude significantly predicts points, so other 2 variables are excluded. Low p-value for deep-learning probably arises from the fact that Attitude and deep learning are highly correlated (r=0.8)
model <- lm(Points ~ Attitude, data = analysis)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude, data = analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
So the attitude of the student seems to correlate with how much points (s)he will score in the test; better attitude predicts better scores. R-squared value explains how big proportion of variance of the dependent variable (points) is explained by the model. In this case only 18.5% of the variability in points is explained by the proposed model.
par(mfrow = c(2,2), oma = c(0, 0, 2, 0), mar = c(2.5,3,2,0.5), mgp = c(1.5,.5,0))
plot(model, which = c(1,2), add.smooth = T)
norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))
aa <- analysis$Attitude
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)
plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")
According to diagnostic plots, this model has no critical errors:
1. Residuals are the difference between fitted and the actual value. In this plot we see no clustering, or any other patterns, that could indicate problems in the model. Variance of errors is constant
2. On Q-Q plot, on the extreme values, the model loses some of its accuracy. In other words, the model overestimates performance of students with either very positive or negative attitude. Shapiro-Wilkes test doesn’t agree with normality, however: “shapiro.test(rstandard(model))$p.value)” which gives p = 0.0032512. Erros in this model are distributed normally enough.
3. Leverage describes the unusualness of predictor values. For any individual variable, it tells how extreme, eg. how far of variable mean any particular observation is. For multivariate model, these “hatvalues” take into consideration “combined unusualness” across variables – observation might not be far from mean in any single variable, but combination of those values might be. In this model, no observations have high leverage.
4. Cook’s distance tells how much fitted values would change if the observation in question is deleted – it can be used to diagnose and remove influential outliers. No single observation affects model too much.
The code used for data wrangling exercise can be found here
#Hidden
library(knitr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(tidyr)
library(stringr)
library(dplyr, warn.conflicts = F)
source("helper_functions.R")
library(pROC)
df <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", head = T) %>% tbl_df()
#Hidden
summaryKable(df) %>%
kable("html", align = "rrr", caption = "Data variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(width = "100%")
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | nursery | internet | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | higher | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | alc_use | high_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Levels | Levels | 15.000 | Levels | Levels | Levels | 0.000 | 0.000 | Levels | Levels | Levels | Levels | Levels | Levels | 1.000 | 1.000 | 0.000 | Levels | Levels | Levels | Levels | Levels | Levels | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 3.000 | 0.000 | 0.000 | 1.000 | 0.000 |
| 1st Q | GP: 342 | F: 198 | 16.000 | R: 81 | GT3: 278 | A: 38 | 2.000 | 2.000 | at_home: 53 | at_home: 16 | course: 140 | no: 72 | no: 58 | father: 91 | 1.000 | 1.000 | 0.000 | no: 331 | no: 144 | no: 205 | no: 181 | no: 18 | no: 261 | 4.000 | 3.000 | 2.000 | 1.000 | 1.000 | 3.000 | 0.000 | 8.000 | 8.250 | 8.000 | 1.000 | 0.000 |
| Median | MS: 40 | M: 184 | 17.000 | U: 301 | LE3: 104 | T: 344 | 3.000 | 3.000 | health: 33 | health: 17 | home: 110 | yes: 310 | yes: 324 | mother: 275 | 1.000 | 2.000 | 0.000 | yes: 51 | yes: 238 | yes: 177 | yes: 201 | yes: 364 | yes: 121 | 4.000 | 3.000 | 3.000 | 1.000 | 2.000 | 4.000 | 3.000 | 10.500 | 11.000 | 11.000 | 1.500 | 0.000 |
| Mean | – | – | 16.586 | – | – | – | 2.806 | 2.565 | other: 138 | other: 211 | other: 34 | – | – | other: 16 | 1.442 | 2.034 | 0.291 | – | – | – | – | – | – | 3.940 | 3.223 | 3.113 | 1.474 | 2.280 | 3.579 | 5.319 | 10.861 | 10.712 | 10.387 | 1.877 | 0.293 |
| 3rd Q | – | – | 17.000 | – | – | – | 4.000 | 4.000 | services: 96 | services: 107 | reputation: 98 | – | – | – | 2.000 | 2.000 | 0.000 | – | – | – | – | – | – | 5.000 | 4.000 | 4.000 | 2.000 | 3.000 | 5.000 | 8.000 | 13.000 | 13.000 | 14.000 | 2.500 | 1.000 |
| Max | – | – | 22.000 | – | – | – | 4.000 | 4.000 | teacher: 62 | teacher: 31 | – | – | – | – | 4.000 | 4.000 | 3.000 | – | – | – | – | – | – | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 5.000 | 75.000 | 19.000 | 19.000 | 20.000 | 5.000 | 1.000 |
So, the data explores school performance of portuguese students. This set contains information of 382 students, their grades and several variables that might correlate with performance.
I thought that good family relations and extra-curricular activities would protect from high alcohol usage. Number of absences and going more out with friends would, conversely, increase alcohol usage.
mod <- df %>% select(high_use, famrel, activities, absences, goout)
mod
## # A tibble: 382 x 5
## high_use famrel activities absences goout
## <lgl> <int> <fctr> <int> <int>
## 1 FALSE 4 no 6 4
## 2 FALSE 5 no 4 3
## 3 TRUE 4 no 10 2
## 4 FALSE 3 yes 2 2
## 5 FALSE 4 no 4 2
## 6 FALSE 5 yes 10 2
## 7 FALSE 4 no 0 4
## 8 FALSE 4 no 6 4
## 9 FALSE 4 no 0 2
## 10 FALSE 5 yes 0 1
## # ... with 372 more rows
mod %>% select(famrel:goout) %>%
sapply(., function(variable_value) table(mod$high_use, variable_value))
## $famrel
## variable_value
## 1 2 3 4 5
## FALSE 7 9 40 131 83
## TRUE 2 9 26 52 23
##
## $activities
## variable_value
## no yes
## FALSE 122 148
## TRUE 59 53
##
## $absences
## variable_value
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## FALSE 94 2 54 2 36 4 19 6 13 3 12 1 6 0 7 1 2 0 2 0 1 1
## TRUE 22 1 13 4 15 0 11 2 8 0 5 1 4 3 4 2 4 1 2 1 1 0
## variable_value
## 22 23 24 25 26 28 30 54 56 75
## FALSE 0 1 0 1 1 0 0 0 0 1
## TRUE 3 0 1 0 0 1 1 1 1 0
##
## $goout
## variable_value
## 1 2 3 4 5
## FALSE 21 84 100 43 22
## TRUE 3 15 23 39 32
#Hidden
p1 <- ggplot(mod, aes(high_use, famrel)) + geom_boxplot()
p2 <- ggplot(mod, aes(high_use, fill = activities)) + geom_bar()
p3 <- ggplot(mod, aes(high_use, absences)) + geom_boxplot()
p4 <- ggplot(mod, aes(high_use, goout)) + geom_boxplot()
multiplot(p1,p2,p3,p4, cols = 2)
These values are well in line with the proposed hypothesis, except for the activities; it seems that extra-curricular activities do not protect from high alcohol usage.
m <- glm(high_use ~ famrel + activities + absences + goout, data = df, family = binomial)
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + activities + absences + goout,
## family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9971 -0.7713 -0.5066 0.8710 2.4369
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.16563 0.63289 -3.422 0.000622 ***
## famrel -0.39022 0.13579 -2.874 0.004059 **
## activitiesyes -0.42269 0.25302 -1.671 0.094804 .
## absences 0.05834 0.01630 3.579 0.000345 ***
## goout 0.81149 0.12286 6.605 3.98e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 385.49 on 377 degrees of freedom
## AIC: 395.49
##
## Number of Fisher Scoring iterations: 4
Based on the logistic regression model shown above, this hypothesis seems to hold true for all proposed factors
coef(m)
## (Intercept) famrel activitiesyes absences goout
## -2.16563017 -0.39021678 -0.42268545 0.05833905 0.81149225
OR <- coef(m) %>% exp()
CI <- suppressMessages(confint(m)) %>% exp()
cbind(OR, CI) %>% round(3)
## OR 2.5 % 97.5 %
## (Intercept) 0.115 0.032 0.387
## famrel 0.677 0.517 0.882
## activitiesyes 0.655 0.397 1.073
## absences 1.060 1.029 1.098
## goout 2.251 1.782 2.887
A step towards better family relationship is associated with odds .731, meaning that a risk for high alcohol usage is reduced. Growing number of absences, going out, and a lack of extra-curricular activities tend to lead to more high alcohol usage.
df <- mutate(df, prob = predict(m, type = "response"))
plot.roc(df$high_use, df$prob)
df <- mutate(df, pred = prob > .4)
table(high_use = df$high_use, prediction = df$pred)
## prediction
## high_use FALSE TRUE
## FALSE 229 41
## TRUE 47 65
table(high_use = df$high_use, prediction = df$pred) %>%
prop.table() %>% `*`(100) %>% round(2) %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 59.95 10.73 70.68
## TRUE 12.30 17.02 29.32
## Sum 72.25 27.75 100.00
Based on receiver-operating-curve, the threshold for classification should be a little over .3 for optimal accurasy. With a threshold of .4, roughly 77% of predictions are correct. This better than what you should expect from simple toin cossing scenario, so there is some benefit in using our model.
library(boot)
loss_func <- function(class, prob) {
# Adjusted for threshold .4
n_wrong <- abs(class - prob - .1) > .5
mean(n_wrong)
}
loss_func(df$high_use, df$prob)
## [1] 0.2303665
# Easier func for counting wrong
1-(mean(df$high_use == df$pred))
## [1] 0.2303665
cv <- cv.glm(data = df, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2460733
So with 10-fold cross-validation, accuracy decreases from ~77% to ~75%
#Hidden
source("helper_functions.R")
library(magrittr)
library(GGally)
library(ggplot2)
library(stringr)
library(tidyr)
library(MASS)
library(knitr)
library(kableExtra)
library(corrplot)
library(plotly)
library(dplyr)
data("Boston")
df <- Boston %>% tbl_df() %>%
mutate_at(vars(chas), funs(as.factor))
glimpse(df)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
14 variables, for which complete description can be found here. In short, the dataset has information of housing values in different suburbs of Boston. Study includes multiple variables describing safety, housing density, and accessibility, for example.
#Hidden
summaryKable(df) %>%
kable("html", align = "rrr", caption = "Data variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(width = "100%")
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | 0.006 | 0.000 | 0.460 | Levels | 0.385 | 3.561 | 2.900 | 1.130 | 1.000 | 187.000 | 12.600 | 0.320 | 1.730 | 5.000 |
| 1st Q | 0.082 | 0.000 | 5.190 | 0: 471 | 0.449 | 5.886 | 45.025 | 2.100 | 4.000 | 279.000 | 17.400 | 375.377 | 6.950 | 17.025 |
| Median | 0.257 | 0.000 | 9.690 | 1: 35 | 0.538 | 6.208 | 77.500 | 3.207 | 5.000 | 330.000 | 19.050 | 391.440 | 11.360 | 21.200 |
| Mean | 3.614 | 11.364 | 11.137 | – | 0.555 | 6.285 | 68.575 | 3.795 | 9.549 | 408.237 | 18.456 | 356.674 | 12.653 | 22.533 |
| 3rd Q | 3.677 | 12.500 | 18.100 | – | 0.624 | 6.623 | 94.075 | 5.188 | 24.000 | 666.000 | 20.200 | 396.225 | 16.955 | 25.000 |
| Max | 88.976 | 100.000 | 27.740 | – | 0.871 | 8.780 | 100.000 | 12.127 | 24.000 | 711.000 | 22.000 | 396.900 | 37.970 | 50.000 |
#Hidden
ggpairs(df,
title = "Study variable overview",
upper = list(continuous = wrap("cor", size = 2)),
lower = list(
continuous = wrap("points", alpha = .2, size = .3),
combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
angle = 90,
color = "black",
size = 6,
vjust = .5),
axis.text.y = element_text(color = "black", size = 6))
The data is quite spread out.
# Scale function adds attributes, and ggpairs doesn't like it
df %<>% mutate_at(vars(-chas), funs(scale)) %>%
mutate_at(vars(-chas), funs(as.vector))
df$crim <- ntile(df$crim, 4) %>%
factor(labels = c("low", "med-lo", "med-hi", "high"))
table(df$crim)
##
## low med-lo med-hi high
## 127 126 127 126
#Hidden
summaryKable(df) %>%
kable("html", align = "rrr", caption = "Data variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(width = "100%")
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Levels | -0.487 | -1.556 | Levels | -1.464 | -3.876 | -2.333 | -1.266 | -0.982 | -1.313 | -2.705 | -3.903 | -1.530 | -1.906 |
| 1st Q | low: 127 | -0.487 | -0.867 | 0: 471 | -0.912 | -0.568 | -0.837 | -0.805 | -0.637 | -0.767 | -0.488 | 0.205 | -0.799 | -0.599 |
| Median | med-lo: 126 | -0.487 | -0.211 | 1: 35 | -0.144 | -0.108 | 0.317 | -0.279 | -0.522 | -0.464 | 0.275 | 0.381 | -0.181 | -0.145 |
| Mean | med-hi: 127 | 0.000 | 0.000 | – | -0.000 | -0.000 | -0.000 | 0.000 | 0.000 | 0.000 | -0.000 | -0.000 | -0.000 | -0.000 |
| 3rd Q | high: 126 | 0.049 | 1.015 | – | 0.598 | 0.482 | 0.906 | 0.662 | 1.660 | 1.529 | 0.806 | 0.433 | 0.602 | 0.268 |
| Max | – | 3.800 | 2.420 | – | 2.730 | 3.552 | 1.116 | 3.957 | 1.660 | 1.796 | 1.637 | 0.441 | 3.545 | 2.987 |
#Hidden
ggpairs(df,
title = "Study variable overview",
upper = list(continuous = wrap("cor", size = 2)),
lower = list(
continuous = wrap("points", alpha = .2, size = .3),
combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
angle = 90,
color = "black",
size = 6,
vjust = .5),
axis.text.y = element_text(color = "black", size = 6))
asldkfj
sample_ind <- sample(nrow(df), size = nrow(df) * 0.8)
train <- df[sample_ind,]
test <- df[-sample_ind,]
data.frame(dim(train),dim(test))
## dim.train. dim.test.
## 1 404 102
## 2 14 14
asdf
lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
##
## Prior probabilities of groups:
## low med-lo med-hi high
## 0.2400990 0.2475248 0.2524752 0.2599010
##
## Group means:
## zn indus chas1 nox rm age
## low 0.9529012 -0.8896893 0.05154639 -0.8661481 0.4736410 -0.8700304
## med-lo -0.1311456 -0.2301746 0.08000000 -0.5352632 -0.1323258 -0.2550334
## med-hi -0.3855121 0.1812183 0.10784314 0.4430893 0.0218133 0.4633838
## high -0.4872402 1.0170492 0.05714286 1.0486443 -0.4640285 0.8377689
## dis rad tax ptratio black lstat
## low 0.8415520 -0.6858745 -0.7377618 -0.48422332 0.3833527 -0.78562223
## med-lo 0.2962802 -0.5534930 -0.4589325 -0.05521321 0.3077355 -0.06673563
## med-hi -0.4137543 -0.4121415 -0.3084902 -0.24664130 0.1013521 0.07195002
## high -0.8864350 1.6388211 1.5145512 0.78158339 -0.7526594 0.97144202
## medv
## low 0.56586367
## med-lo -0.01150431
## med-hi 0.11806120
## high -0.72356592
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12203679 0.737395966 -0.96785673
## indus 0.04207757 -0.143242138 0.36541807
## chas1 -0.29630930 -0.147975543 0.47291304
## nox 0.32248843 -0.765127374 -1.44932270
## rm -0.07166814 -0.090938011 -0.06328332
## age 0.17148793 -0.374608925 -0.07135382
## dis -0.14085933 -0.256964536 0.20909792
## rad 3.23274660 0.958910381 -0.15893648
## tax -0.07172543 -0.073756557 0.67377689
## ptratio 0.11467200 0.019166437 -0.36279840
## black -0.12511131 0.009428468 0.11139285
## lstat 0.28002815 -0.206961430 0.37504666
## medv 0.15703677 -0.343581532 -0.21201099
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9470 0.0392 0.0138
I made it 3D for fun
#Hidden
# plot the lda results
points <- data.frame(crim = train$crim,
lda = predict(lda.fit)$x)
levels(points$crim) %<>% str_to_title()
arrows <- coef(lda.fit) %>%
data.frame(., label = rownames(.)) %>% arrange(desc(abs(LD1))) %>%
mutate(LD1 = LD1*2.5, LD2 = LD2*2.5, LD3 = LD3*2.5, pos = 1) %>%
rbind(., mutate(., LD1=0, LD2=0, LD3=0, pos =0))
p1 <- plot_ly(arrows, x = ~LD1, y = ~LD2, z = ~LD3,
type = "scatter3d" , color = ~label, colors = rep(rgb(0, 0, 0), 13),
opacity = .5, mode = "lines", hoverinfo = "name", showlegend = FALSE,
line = list(width = 5))
p2 <- plot_ly(points, x = ~lda.LD1, y = ~lda.LD2, z = ~lda.LD3,
type = "scatter3d" , color = ~crim, opacity = .5, hoverinfo = "none",
mode = "markers", marker = list(size = 3, width = 2)) %>%
layout(title = "PCA",
scene = list(xaxis = list(title = "LDA1"),
yaxis = list(title = "LDA2"),
zaxis = list(title = "LDA3")))
subplot(p1, p2)
table("Crime" = test$crim,
"Prediction" = predict(lda.fit, newdata = test)$class)
## Prediction
## Crime low med-lo med-hi high
## low 17 11 2 0
## med-lo 6 17 3 0
## med-hi 1 6 16 2
## high 0 0 0 21
Many comments here
data("Boston")
df <- Boston %>% tbl_df
df %<>% mutate_at(vars(-chas), funs(scale)) %>%
mutate_at(vars(-chas), funs(as.vector))
cbind(Bost = summary(dist(Boston)), df = summary(dist(df))) %>% t()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Bost 1.1190 85.620 170.500 226.300 371.900 626.00
## df 0.1343 3.266 4.612 4.727 5.957 13.88
km <- kmeans(df, centers = 3)
twcss <- sapply(1:20, function(k) kmeans(df, k)$tot.withinss)
qplot(x = 1:20, y = twcss, geom = 'line')
Based on this, optimal value around 3 centers.
km <- kmeans(df, centers = 3)
#Hidden
p1 <- df %>% mutate(clust = factor(km$cluster)) %>%
gather(., key = "var", value = "value", -c(crim, clust)) %>%
ggplot(aes(value, crim, color = clust)) +
geom_point(shape = 1, size = 1.5, stroke = 1, alpha = .3) +
facet_wrap(~var, scales = "free_x")
p2 <- cor(df) %>% corrplot(method = "pie", type = "upper")
multiplot(p1, p2, cols = 2)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## ptratio black lstat medv
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046
## zn -0.3916785 0.17552032 -0.4129946 0.3604453
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252
## chas -0.1215152 0.04878848 -0.0539293 0.1752602
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208
## rm -0.3555015 0.12806864 -0.6138083 0.6953599
## age 0.2615150 -0.27353398 0.6023385 -0.3769546
## dis -0.2324705 0.29151167 -0.4969958 0.2499287
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867
## black -0.1773833 1.00000000 -0.3660869 0.3334608
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627
## medv -0.5077867 0.33346082 -0.7376627 1.0000000
Tax,
# Add clusters of 3-centered kmeans to normalized boston dataset and remove variable crim
df %<>% mutate(clust = factor(km$cluster)) %>% select(-crim)
lda.clust <- lda(clust ~ ., data = df)
lda.clust
## Call:
## lda(clust ~ ., data = df)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3162055 0.4031621 0.2806324
##
## Group means:
## zn indus chas nox rm age
## 1 0.9417744 -0.951923132 0.03750000 -0.935011284 0.6866337 -1.0760835
## 2 -0.3994892 -0.001288196 0.09803922 0.008046344 -0.2499368 0.3103110
## 3 -0.4872402 1.074440092 0.06338028 1.041974304 -0.4146077 0.7666895
## dis rad tax ptratio black lstat
## 1 1.0201445 -0.5992881 -0.6788173 -0.53836625 0.3574835 -0.87226188
## 2 -0.2191145 -0.5748410 -0.5014715 -0.08995574 0.2490217 0.08575447
## 3 -0.8346743 1.5010821 1.4852884 0.73584205 -0.7605477 0.85963373
## medv
## 1 0.73518453
## 2 -0.09806608
## 3 -0.68749327
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn 0.08078910 0.37287165
## indus 0.31827624 -0.11659115
## chas 0.17733996 -0.51077583
## nox 0.13412889 0.13279542
## rm -0.13664074 0.46567725
## age 0.13086441 -0.90193834
## dis -0.21552999 0.51765442
## rad 1.96178033 0.64834390
## tax 1.09546464 0.57552090
## ptratio 0.13693307 -0.04730125
## black -0.15249319 -0.07384592
## lstat 0.13349513 0.17975971
## medv -0.01008698 0.31076349
##
## Proportion of trace:
## LD1 LD2
## 0.8991 0.1009
#Hidden
points <- data.frame(clust = df$clust,
lda = predict(lda.clust)$x)
names(points) <- c("Cluster", "LD1", "LD2")
arrows <- coef(lda.clust) %>%
data.frame(., label = str_to_upper(rownames(.))) %>%
arrange(desc(abs(LD1))) %>%
# Scale the arrows
mutate(LD1 = LD1*5, LD2 = LD2*5, pos = 1)
ggplot(points) +
theme_minimal() +
geom_point(aes(LD1, LD2, color = Cluster),
shape = 1, size = 2, stroke = 1.5, alpha = .75) +
geom_segment(data = arrows,
aes(y=0, x=0, yend=LD2, xend=LD1, alpha = .5),
arrow = arrow(length = unit(0.075, "inches")),
show.legend = F) +
# Adjust the labels .2 units away from the arrowhead
geom_text(data = arrows, aes(x = LD1+.2*(LD1/sqrt(LD1^2+LD2^2)),
y = LD2+.2*(LD2/sqrt(LD1^2+LD2^2)),
hjust = .5,
label = label),
show.legend = F)
The strongest predictors of clusters are tax, nox, zn, and age.
model_predictors <- dplyr::select(train, -crim) %>%
mutate(chas = as.numeric(chas))
# check the dimensions
data.frame(dim(model_predictors), dim(lda.fit$scaling))
## dim.model_predictors. dim.lda.fit.scaling.
## 1 404 13
## 2 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
head(matrix_product)
## LD1 LD2 LD3
## 1 -3.3428785 -1.2274482 -0.30268414
## 2 5.7432258 -0.2309955 0.03355818
## 3 -2.2818422 -0.5470764 0.73907670
## 4 -0.8856217 -3.2495587 -2.00746713
## 5 -3.4837339 -0.4942614 0.76053385
## 6 -2.6114779 -0.1154525 0.34281064
#Hidden
p1 <- plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3,
type= 'scatter3d', mode='markers', color = train$crim,
marker = list(size = 3, width = 2), scene = "scene1")
# I run the Kmeans modeling with all observations, so I need to either adjust by selecting only relevant observations
train_km <- as.data.frame(km$cluster)[sample_ind,] %>% as.factor()
p2 <- plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3,
type= 'scatter3d', mode='markers', color = train_km,
marker = list(size = 3, width = 2), scene = "scene2")
subplot(p1,p2) %>%
layout(scene1 = list(domain=list(x=c(0,0.5),y=c(0,1)),
xaxis = list(title = "LD1"),
yaxis = list(title = "LD2"),
zaxis = list(title = "LD3")),
scene2 = list(domain=list(x=c(0.5,1),y=c(0,1)),
xaxis = list(title = "LD1"),
yaxis = list(title = "LD2"),
zaxis = list(title = "LD3")))
Most obvious difference is of course that the first lda-plot has 4 classes whereas the km-plot only has 3. Clustering is done fairly similarly, both models easily distinguish the group with very positive LD1 values. Kmeans seems to achieve (at least visually judging) better separation in the other, larger, cluster.
Code for data wrangling can be found here
#Hidden
source("helper_functions.R")
library(magrittr)
library(GGally)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(tidyr)
library(MASS)
library(dplyr)
df <- read.csv(file = "data/human.csv", row.names = 1)
glimpse(df)
## Observations: 155
## Variables: 8
## $ eduRatio <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.96...
## $ labRatio <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.82...
## $ expEdu <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, ...
## $ lifeExp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, ...
## $ GNIC <dbl> 64.992, 42.261, 56.431, 44.025, 45.435, 43.919, ...
## $ matMort <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, ...
## $ adolBirthRate <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, ...
## $ reprParl <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, ...
Dataset contains information of human development charasteritics. It totals 189 observations (countries) and 8 variables.
#Hidden
df %>% summaryKable() %>%
kable("html", align = "rrr", caption = "Data variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed")) %>%
scroll_box(width = "100%")
| eduRatio | labRatio | expEdu | lifeExp | GNIC | matMort | adolBirthRate | reprParl | |
|---|---|---|---|---|---|---|---|---|
| Min | 0.172 | 0.186 | 5.400 | 49.000 | 1.123 | 1.000 | 0.600 | 0.000 |
| 1st Q | 0.726 | 0.598 | 11.250 | 66.300 | 5.275 | 11.500 | 12.650 | 12.400 |
| Median | 0.938 | 0.753 | 13.500 | 74.200 | 13.054 | 49.000 | 33.600 | 19.300 |
| Mean | 0.853 | 0.707 | 13.176 | 71.654 | 46.496 | 149.084 | 47.159 | 20.912 |
| 3rd Q | 0.997 | 0.854 | 15.200 | 77.250 | 27.256 | 190.000 | 71.950 | 27.950 |
| Max | 1.497 | 1.038 | 20.200 | 83.500 | 908.000 | 1100.000 | 204.800 | 57.500 |
# Hidden
ggpairs(df,
title = "Study variable overview",
upper = list(continuous = wrap("cor", size = 3)),
lower = list(
continuous = wrap("points", alpha = .2, size = .4),
combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
angle = 90,
color = "black",
size = 6,
vjust = .5),
axis.text.y = element_text(color = "black", size = 6))
Many of the variables are normally distributed, but with eduRatio, GNIC and matMort, some problems might arise if normality is assumed.
#Hidden
cor(df, use = "pairwise.complete.obs") %>%
corrplot::corrplot(method = "number", type = "upper",
tl.pos = "td", tl.srt = 25,
tl.offset = 1, tl.cex = .7,
number.cex = .75,
title = "Correlation between study variables",
mar=c(0,0,1,0),
diag = F)
According to above plot, education correlates strongly and positively with life expectancy, and negatively with maternal mortality and adolescent births. Maternal mortality correlates strongly and negatively with education ratio, life expectancy and education.
pca <- prcomp(df)
pca %>% summary()
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 227.566 121.2798 26.48071 11.47526 3.89773 1.60859
## Proportion of Variance 0.769 0.2184 0.01041 0.00196 0.00023 0.00004
## Cumulative Proportion 0.769 0.9874 0.99778 0.99974 0.99996 1.00000
## PC7 PC8
## Standard deviation 0.1915 0.1598
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
#Hidden
biplot(pca, choices = 1:2)
df_scaled <- scale(df)
pca <- prcomp(df_scaled)
pca %>% summary
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.9970 1.1465 0.9439 0.85645 0.69002 0.53246
## Proportion of Variance 0.4985 0.1643 0.1114 0.09169 0.05952 0.03544
## Cumulative Proportion 0.4985 0.6628 0.7742 0.86585 0.92536 0.96080
## PC7 PC8
## Standard deviation 0.46511 0.31184
## Proportion of Variance 0.02704 0.01216
## Cumulative Proportion 0.98784 1.00000
#Hidden
biplot(pca, choices = 1:2)
library(FactoMineR)
data(tea)
df <- tea
mca <- df %>% select(breakfast, friends, resto, Tea, Sport) %>%
MCA(graph = FALSE)
mca %>% summary
##
## Call:
## MCA(X = ., graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.268 0.235 0.213 0.174 0.160 0.149
## % of var. 22.370 19.550 17.778 14.522 13.358 12.421
## Cumulative % of var. 22.370 41.920 59.698 74.221 87.579 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.529 0.348 0.198 | -0.982 1.371 0.684 | 0.302
## 2 | 0.529 0.348 0.198 | -0.982 1.371 0.684 | 0.302
## 3 | -0.753 0.704 0.517 | 0.256 0.093 0.060 | -0.242
## 4 | 0.367 0.167 0.130 | 0.372 0.197 0.133 | 0.220
## 5 | 0.361 0.162 0.143 | -0.618 0.543 0.419 | -0.446
## 6 | 0.384 0.183 0.167 | -0.007 0.000 0.000 | -0.351
## 7 | -0.150 0.028 0.035 | -0.303 0.131 0.144 | -0.600
## 8 | 0.552 0.378 0.221 | -0.372 0.196 0.100 | 0.397
## 9 | 0.361 0.162 0.143 | -0.618 0.543 0.419 | -0.446
## 10 | 0.513 0.326 0.167 | -0.603 0.517 0.231 | 0.873
## ctr cos2
## 1 0.143 0.065 |
## 2 0.143 0.065 |
## 3 0.091 0.053 |
## 4 0.075 0.046 |
## 5 0.310 0.218 |
## 6 0.193 0.140 |
## 7 0.563 0.563 |
## 8 0.246 0.114 |
## 9 0.310 0.218 |
## 10 1.190 0.485 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | -0.031 0.034 0.001 -0.515 | -0.769 24.200 0.546
## Not.breakfast | 0.029 0.032 0.001 0.515 | 0.710 22.339 0.546
## friends | -0.459 10.250 0.397 -10.893 | 0.264 3.885 0.131
## Not.friends | 0.865 19.318 0.397 10.893 | -0.498 7.322 0.131
## Not.resto | 0.427 9.993 0.509 12.341 | 0.032 0.066 0.003
## resto | -1.194 27.955 0.509 -12.341 | -0.091 0.185 0.003
## black | 0.093 0.158 0.003 0.918 | -0.774 12.612 0.196
## Earl Grey | -0.343 5.654 0.213 -7.976 | 0.108 0.641 0.021
## green | 1.801 26.573 0.401 10.946 | 1.104 11.439 0.151
## Not.sportsman | -0.025 0.019 0.000 -0.362 | 0.548 10.329 0.203
## v.test Dim.3 ctr cos2 v.test
## breakfast -12.776 | -0.114 0.580 0.012 -1.886 |
## Not.breakfast 12.776 | 0.105 0.536 0.012 1.886 |
## friends 6.270 | -0.124 0.938 0.029 -2.937 |
## Not.friends -6.270 | 0.233 1.767 0.029 2.937 |
## Not.resto 0.938 | -0.161 1.782 0.072 -4.646 |
## resto -0.938 | 0.449 4.985 0.072 4.646 |
## black -7.663 | 1.270 37.305 0.528 12.567 |
## Earl Grey 2.510 | -0.457 12.585 0.376 -10.608 |
## green 6.714 | -0.177 0.322 0.004 -1.073 |
## Not.sportsman 7.792 | 0.787 23.390 0.418 11.182 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.001 0.546 0.012 |
## friends | 0.397 0.131 0.029 |
## resto | 0.509 0.003 0.072 |
## Tea | 0.435 0.290 0.536 |
## Sport | 0.000 0.203 0.418 |
#Hidden
mca %>% plot(invisible=c("ind"))